Post-hoc analysis (from Latin post hoc, "after this"), in the context of design and analysis of experiments, refers to looking at the data—after the experiment has concluded—for patterns that were not specified a priori. It is sometimes called by critics data dredging to evoke the sense that the more one looks the more likely something will be found. More subtly, each time a pattern in the data is considered, a statistical test is effectively performed. This greatly inflates the total number of statistical tests and necessitates the use of multiple testing procedures to compensate. However, this is difficult to do precisely and in fact most results of post-hoc analyses are reported as they are with unadjusted p-values. These p-values must be interpreted in light of the fact that they are a small and selected subset of a potentially large group of p-values. Results of post-hoc analysis should be explicitly labeled as such in reports and publications to avoid misleading readers.
In practice, post-hoc analyses are usually concerned with finding patterns and/or relationships between subgroups of sampled populations that would otherwise remain undetected and undiscovered were a scientific community to rely strictly upon a priori statistical methods. Post-hoc tests — also known as a posteriori tests — greatly expand the range and capability of methods that can be applied in exploratory research. Post-hoc examination strengthens induction by limiting the probability that significant effects will seem to have been discovered between subgroups of a population when none actually exist. As it is, many scientific papers are published without adequate, preventative post-hoc control of the Type I Error Rate.[1]
Post-hoc analysis is an important procedure without which multivariate hypothesis testing would greatly suffer, rendering the chances of discovering false positives unacceptably high. Ultimately, post-hoc testing creates better informed scientists who can therefore formulate better, more efficient a priori hypotheses and research designs.
Contents |
The Student Newman–Keuls and related tests are often referred to as post hoc. However, an experimenter often plans to test all pairwise comparisons before seeing the data. Therefore these tests are better categorized as a priori.
An example of an analysis often mislabeled as a post-hoc analysis is the Newman–Keuls method: "A different approach to evaluating a posteriori pairwise comparisons stems from the work of Student (1927),[2] Newman (1939),[3] and Keuls (1952).[4] The Newman–Keuls procedure is based on a stepwise or layer approach to significance testing. Sample means are ordered from the smallest to the largest. The largest difference, which involves means that are r = p steps apart, is tested first at α level of significance; if significant, means that are r = p − 1 steps apart are tested at α level of significance and so on. The Newman–Keuls procedure provides an r-mean significance level equal to α for each group of r ordered means, that is, the probability of falsely rejecting the hypothesis that all means in an ordered group are equal to α. It follows that the concept of error rate applies neither on an experimentwise nor on a per comparison basis–the actual error rate falls somewhere between the two. The Newman–Keuls procedure, like Tukey's procedure, requires equal sample n's.
The critical difference , that two means separated by r steps must exceed to be declared significant is, according to the Newman–Keuls procedure,
The Newman–Keuls and Tukey procedures require the same critical difference for the first comparison that is tested. The Tukey procedure uses this critical difference for all the remaining tests, whereas the Newman–Keuls procedure reduces the size of the critical difference, depending on the number of steps separating the ordered means. As a result, the Newman–Keuls test is more powerful than Tukey's test. Remember, however, that the Newman–Keuls procedure does not control the experimentwise error rate at α.
Frequently a test of the overall null hypothesis m1 = m2 = … = mp is performed with an F statistic in ANOVA rather than with a range statistic. If the F statistic is significant, Shaffer (1979) recommends using the critical difference instead of to evaluate the largest pairwise comparison at the first step of the testing procedure. The testing procedure for all subsequent steps is unchanged. She has shown that the modified procedure leads to greater power at the first step without affecting control of the type I error rate. This makes dissonances, in which the overall null hypothesis is rejected by an F test without rejecting any one of the proper subsets of comparison, less likely."